Which classification algorithm works best with stylistic features of Portuguese in order to classify web texts according to users’ needs?

نویسندگان

  • Rachel Aires
  • Aline Manfrin
  • Sandra Aluísio
  • Diana Santos
چکیده

In order to improve Web Information Retrieval, we have, in a previous work (Aires et al., 2004), investigated the use of stylistic features of Web texts in Portuguese to classify web pages according to users’ needs, using in most of the experiments the classification algorithm J48 (the Weka implementation of C4.5). From that study, we concluded that it was possible to identify some of the categories reliably, but we should investigate whether it was possible to get even better classification schemes using other algorithms. Language is a different domain, and the fact that C4.5 has been used successfully in other applications (even others dealing with written language) does not imply that it is also the best solution for our problem. In this paper, we document the replication of the experiments presented in Aires et al (2004), using all relevant Weka algorithms, also providing more information on the linguistic features used and on the issues concerning algorithm choice.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

What is my Style? Using Stylistic Features of Portuguese Web Texts to Classify Web Pages According to Users' Needs

In this paper we investigate the use of stylistic features of Web texts in Portuguese to classify web pages according to users’ needs, in order to improve Web Information Retrieval. We first describe a seven categories classification of users ́ needs, which was the outcome of a qualitative analysis of two TodoBr logs (a major Brazilian search engine). We describe 46 shallow linguistic features, ...

متن کامل

MHSubLex: Using Metaheuristic Methods for Subjectivity Classification of Microblogs

In Web 2.0, people are free to share their experiences, views, and opinions. One of the problems that arises in web 2.0 is the sentiment analysis of texts produced by users in outlets such as Twitter. One of main the tasks of sentiment analysis is subjectivity classification. Our aim is to classify the subjectivity of Tweets. To this end, we create subjectivity lexicons in which the words into ...

متن کامل

An Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification

Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text ...

متن کامل

Document Analysis And Classification Based On Passing Window

In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...

متن کامل

Stylistic Changes for Temporal Text Classification

This paper investigates stylistic changes in a set of Portuguese historical texts ranging from the 17 to the early 20 century and presents a supervised method to classify them per century. Four stylistic features – average sentence length (ASL), average word length (AWL), lexical density (LD), and lexical richness (LR) – were automatically extracted for each sub-corpus. The initial analysis of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004